Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: arrow_path_to_parquet #1239

Merged
merged 10 commits into from
Mar 24, 2025
Merged

test: arrow_path_to_parquet #1239

merged 10 commits into from
Mar 24, 2025

Conversation

de-sh
Copy link
Contributor

@de-sh de-sh commented Mar 14, 2025

Fixes #XXXX.

Description

Adds tests to validate working of the arrow_path_to_parquet function


This PR has:

  • been tested to ensure log ingestion and log query works.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features
    • Enhanced conversion from arrow file paths to parquet file paths, including improved error handling.
    • Added logging for unexpected filename structures to strengthen file handling.
  • Bug Fixes
    • Updated file extension filtering logic to utilize a constant value for better maintainability.
  • Tests
    • New test cases added to validate the file path conversion functionality across various scenarios and edge cases.

Copy link

coderabbitai bot commented Mar 14, 2025

Walkthrough

The changes introduce a new implementation of the arrow_path_to_parquet function, which converts valid arrow file paths into parquet file paths. This function checks for a delimiter in the filename and constructs a parquet filename using a specified random string, returning an Option<PathBuf>. The previous implementation has been removed due to its lack of error handling. Additionally, the filtering logic in the arrow_files method has been updated to use a constant ARROW_FILE_EXTENSION, and the arrow_files_grouped_exclude_time method now logs warnings for unexpected arrow files.

Changes

File(s) Summary
src/parseable/streams.rs - Removed the old implementation of arrow_path_to_parquet.
- Added new function fn arrow_path_to_parquet(path: &Path, random_string: &str) -> Option<PathBuf> for converting arrow file paths to parquet file paths.
- Updated arrow_files to use ARROW_FILE_EXTENSION for file extension checks.
- Modified arrow_files_grouped_exclude_time to log warnings for unexpected arrow files.
- Added new test cases for arrow_path_to_parquet.

Possibly related PRs

  • log: track conversion times #1241: The changes in the main PR, which involve a new implementation of the arrow_path_to_parquet function, are related to the retrieved PR as both involve the conversion processes between arrow and parquet formats, specifically within the same file, src/parseable/streams.rs.
  • fix: ensure panic safety #1212: The changes in the main PR are related to the modifications in error handling and the use of the ARROW_FILE_EXTENSION constant in the retrieved PR, specifically in the context of the src/parseable/streams.rs file.
  • fix: capture time created from metadata not filename #1211: The changes in the main PR introduce a new implementation of the arrow_path_to_parquet function and modify the arrow_files and arrow_files_grouped_exclude_time methods, while the retrieved PR focuses on updating the arrow_files_grouped_exclude_time method's parameter type and logic, indicating a direct relationship in their modifications to the same method.

Suggested labels

for next release

Suggested reviewers

  • nikhilsinhaparseable

Poem

I'm a little rabbit, hopping through the code,
With new paths to parquet, my excitement's bestowed.
Warnings now echo, when files go astray,
In the garden of logic, we dance and we play.
With each little change, our project will grow,
Hopping through code, with a joyful glow!
🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/parseable/streams.rs (1)

94-106: Improve error handling in arrow_path_to_parquet function.

The function correctly extracts the front part from the arrow file name using the regex pattern, but it uses unwrap() twice on line 96, which could panic if the path has no filename or the filename is not valid UTF-8.

Consider improving the error handling:

 fn arrow_path_to_parquet(path: &Path, random_string: &str) -> Option<PathBuf> {
-    let filename = path.file_name().unwrap().to_str().unwrap();
+    let filename = path.file_name()?.to_str()?;
     let filename = ARROWS_NAME_STRUCTURE
         .captures(filename)
         .and_then(|c| c.get(1))?
         .as_str();
     let filename_with_random_number = format!("{filename}.data.{random_string}.parquet");
     let mut parquet_path = path.to_owned();
     parquet_path.set_file_name(filename_with_random_number);

     Some(parquet_path)
 }

This makes the function more robust by using the ? operator to propagate the None value if either operation fails.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between db4a68d and 41b3688.

📒 Files selected for processing (1)
  • src/parseable/streams.rs (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: coverage
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-apple-darwin
🔇 Additional comments (4)
src/parseable/streams.rs (4)

36-45: Good addition of necessary imports.

The imports for once_cell::sync::Lazy and regex::Regex are appropriate for implementing the regex-based file name validation solution.


224-231: Improved arrow file filtering using regex.

The updated filtering logic using the regex pattern provides more robust validation of arrow filenames compared to the previous simple extension check.


274-281: Good error handling with warning for unexpected arrow files.

Adding a warning log when an arrow file doesn't match the expected pattern is a good practice. It will help identify problematic files during operation.


73-92:

❓ Verification inconclusive

Well-documented regex pattern for arrow file name validation.

The static regex pattern with detailed documentation explains the expected format, limitations, and gives examples, which is excellent for maintainability.

One consideration: the pattern uses \S+ for the front part which allows any non-whitespace character, but the documentation mentions limitations on partition keys/values. Consider if the regex should be more restrictive to enforce the documented limitations.


🏁 Script executed:

#!/bin/bash
# Check if there are any existing arrow files that might not match the new pattern
# This will help identify if the regex might be too restrictive for existing files

# Find current arrow files
echo "Searching for existing arrow files..."
find . -name "*.arrows" | grep -v "node_modules" | head -n 10

# Test a few example filenames against the regex
echo -e "\nTesting regex against sample filenames:"
test_regex() {
  if [[ "$1" =~ ^[a-zA-Z0-9&=]+\.(.*?)\.data\.arrows$ ]]; then
    echo "MATCH: $1"
  else
    echo "NO MATCH: $1"
  fi
}

# Valid examples
test_regex "abc123.date=2023-01-21.hour=10.minute=30.key1=value1.key2=value2.ee529ffc8e76.data.arrows"
test_regex "20200201T1830f8a5fc1edc567d56&key1=value1&key2=value2.date=2020-01-21.hour=10.minute=30.key1=value1.key2=value2.ee529ffc8e76.data.arrows"

# Invalid examples
test_regex "abc-123.date=2023-01-21.hour=10.minute=30.key1=value1.key2=value2.ee529ffc8e76.data.arrows" # Invalid character in schema_key
test_regex "abc123.date=2023-01-21.hour=10.minute=30.key1=value-1.key2=value2.ee529ffc8e76.data.arrows" # Has hyphen in value

Length of output: 374


Attention: Refine Regex Restrictions & Address Shell Test Syntax

The documentation for the arrow file regex is clear and detailed, which is excellent for maintainability. However, two points need your attention:

  • Regex Specificity: The regex currently uses \S+ to capture the "front" part, which permits any non-whitespace characters. This does not enforce the documented limitations that partition keys and values must be strictly alphanumeric. Please consider updating the regex to use a more precise character class that only allows alphanumeric characters (and specific allowed punctuation) as described.

  • Shell Test Script Issue: The test script encountered a syntax error due to the unescaped ampersand (&) within the regex. When running regex tests in bash, special characters like & should be properly escaped or quoted to prevent shell misinterpretation. This adjustment will ensure reliable test execution.

Please address these items to align the implementation with the documentation and improve test reliability.

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 14, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 41b3688 and 2995cac.

📒 Files selected for processing (1)
  • src/parseable/streams.rs (6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: coverage
🔇 Additional comments (5)
src/parseable/streams.rs (5)

36-36: Appropriate imports added.

The imports for once_cell::sync::Lazy and regex::Regex are correctly added to support the new regex-based file name extraction functionality.

Also applies to: 45-45


73-92: Good documentation and regex pattern implementation.

The documentation thoroughly explains the format, limitations, and provides examples for the arrow file name structure. Using Lazy::new() for initializing the static regex is a good practice that ensures the regex is compiled only once when needed.

The regex pattern looks correct for validating the described format:

  • Schema key (alphanumeric with & and = characters)
  • Front part (for parquet file naming)
  • Suffix .data.arrows

227-231: Improved file filtering with regex validation.

Replacing the simple extension check with regex validation improves the robustness of the arrow_files method by ensuring that files follow the expected naming pattern.


274-281: Added warning for unexpected arrow files.

Good improvement to log warnings for arrow files that don't match the expected pattern. This will help with debugging file naming issues.


1234-1296: Comprehensive test coverage for the new functionality.

The tests cover a good range of scenarios including:

  • Valid arrow path conversion
  • Invalid arrow paths (missing suffix)
  • Invalid schema keys with special characters
  • Complex nested paths
  • Edge cases with empty front part

These tests will help ensure the robustness of the implementation.

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 14, 2025
coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 14, 2025
de-sh added a commit to de-sh/parseable that referenced this pull request Mar 16, 2025
@de-sh de-sh changed the title refactor: use regex to extract info from arrow file name test: arrow_path_to_parquet Mar 21, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e87d96f and 684b8b1.

📒 Files selected for processing (1)
  • src/parseable/streams.rs (5 hunks)
🔇 Additional comments (3)
src/parseable/streams.rs (3)

198-200: Improved file filtering approach

Good use of the ARROW_FILE_EXTENSION constant and the more concise is_some_and method for filtering. This makes the code more maintainable and less prone to errors from hardcoded strings.


243-249: Better error handling for unexpected arrow files

The new implementation properly handles cases where arrow_path_to_parquet returns None, adding helpful warning logs. This improves robustness and debuggability when encountering unexpected file formats.


1196-1284:

✅ Verification successful

Comprehensive test coverage for the new functionality

The tests thoroughly cover various scenarios including valid paths, invalid paths, complex directory structures, and edge cases. This helps ensure the reliability of the new implementation.

Verify filename format consistency with these tests:


🏁 Script executed:

#!/bin/bash
# Check if there are any existing arrow filename patterns in the codebase

# Look for regex patterns that might already be defined for arrow filenames
echo "Checking for existing regex patterns for arrow filenames:"
rg -A 2 "Regex::new|lazy_static.*Regex|regex!|static.*Regex" --type rust

# Look for other string operations on arrow filenames
echo "Checking for other string operations on arrow filenames:"
rg -A 2 "\.split|\.parse|\.contains.*arrow" --type rust

Length of output: 15295


Approval: Arrow Filename Conversion Tests Provide Comprehensive Coverage and Maintain Consistent Format

The test suite in src/parseable/streams.rs thoroughly validates the arrow file conversion logic by:

  • Confirming that valid arrow filenames are correctly transformed into parquet filenames using the supplied random string.
  • Ensuring that edge cases—such as invalid paths, schema keys with disallowed characters, empty front parts, and files within nested directories—are properly handled.
  • Verifying that the filename format in the conversion function (i.e., splitting on the first dot and appending the new suffix) is consistent with the behavior expected by the tests.

Additionally, our search of the codebase reveals no conflicting regex patterns or alternative string manipulation methods for arrow filenames. This confirms that the current implementation is consistent and in line with the repository’s overall filename handling patterns.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/parseable/streams.rs (1)

70-80: ⚠️ Potential issue

Mismatch between PR objective and implementation

The PR mentions using regex for processing complex filenames, but the function arrow_path_to_parquet still uses string operations (split_once) instead of regular expressions. Additionally, the assertion on line 74 assumes the front part will contain a delimiter, which could cause a panic if not present.

Consider implementing the regex approach as mentioned in the PR objectives:

/// Returns the filename for parquet if provided arrows file path is valid as per our expectation
fn arrow_path_to_parquet(path: &Path, random_string: &str) -> Option<PathBuf> {
    let filename = path.file_stem()?.to_str()?;
-   let (_, front) = filename.split_once('.')?;
-   assert!(front.contains('.'), "contains the delim `.`");
+   // Use regex to extract parts after the schema key
+   let re = Regex::new(r"^[^.]+\.(.+)$").ok()?;
+   let front = re.captures(filename)?.get(1)?.as_str();
    let filename_with_random_number = format!("{front}.data.{random_string}.parquet");
    let mut parquet_path = path.to_owned();
    parquet_path.set_file_name(filename_with_random_number);

    Some(parquet_path)
}
🧹 Nitpick comments (1)
src/parseable/streams.rs (1)

1196-1284: Comprehensive test coverage for the new functionality

The tests thoroughly cover various scenarios including valid paths, invalid paths, complex paths, and edge cases. This will help maintain the functionality over time.

One improvement to consider - add a test that specifically validates the regex pattern (when implemented) against more diverse and complex filenames that might appear in production.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 684b8b1 and 5c4d580.

📒 Files selected for processing (1)
  • src/parseable/streams.rs (5 hunks)
🔇 Additional comments (2)
src/parseable/streams.rs (2)

197-200: LGTM! Improved code maintainability with constant usage

Using the constant ARROW_FILE_EXTENSION instead of a hardcoded string improves maintainability and consistency.


243-250: Improved error handling for unexpected arrow files

The added else branch properly logs a warning when an unexpected arrow file is encountered, enhancing error visibility.

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 21, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/parseable/streams.rs (1)

248-249: Warn-only approach for unexpected Arrow files.
The code logs a warning for files that don’t match arrow_path_to_parquet. If unrecoverable conditions are possible, consider surfacing them with a stronger error or by removing the file.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5c4d580 and 1d81762.

📒 Files selected for processing (1)
  • src/parseable/streams.rs (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: coverage
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
🔇 Additional comments (7)
src/parseable/streams.rs (7)

197-200: Conditional check looks good.
Using is_some_and here is concise. Just confirm your Rust version is ≥1.63 if you rely on this method.


243-244: Good handling when the path conversion succeeds.
This code cleanly associates valid Parquet keys with their Arrow files.


1215-1222: Helper function for test file creation is clear.
The dummy content writing approach is straightforward and fits well with test needs.


1224-1239: Valid Arrow path test.
This test verifies that the function returns a correct .parquet filename. Looks good.


1241-1252: Potential mismatch with function logic.
This test expects None for missing .data.arrows suffix, but the function only checks for the .arrows extension and a split with .. Verify that this test can pass with the current implementation.


1254-1265: Invalid schema key scenario may not return None.
The function does not reject special characters in front parts. Please verify if the test passes under real usage.


1267-1290: Complex path handling test is comprehensive.
It validates deep directory structures and multiple delimiters. This aligns well with your function’s current logic.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
src/parseable/streams.rs (1)

70-80: 🛠️ Refactor suggestion

Possible panic with assert! contradicts Option return type pattern

The function correctly uses ? operators to propagate errors via Option, but then uses assert! on line 74 which would cause a panic if front doesn't contain a dot. This contradicts the fallible nature of the function and the pattern established with the return type.

The panic condition would interrupt normal program flow rather than allowing the caller to handle the error case gracefully through the Option return type.

Replace the assertion with a fallible check:

fn arrow_path_to_parquet(path: &Path, random_string: &str) -> Option<PathBuf> {
    let filename = path.file_stem()?.to_str()?;
    let (_, front) = filename.split_once('.')?;
-    assert!(front.contains('.'), "contains the delim `.`");
+    if !front.contains('.') {
+        return None;
+    }
    let filename_with_random_number = format!("{front}.{random_string}.parquet");
    let mut parquet_path = path.to_owned();
    parquet_path.set_file_name(filename_with_random_number);

    Some(parquet_path)
}
🧹 Nitpick comments (1)
src/parseable/streams.rs (1)

1241-1264: Test uses complex path but doesn't verify full path correctness

The test validates only the filename portion of the result, but doesn't verify that the directory structure is preserved correctly.

Add an assertion to verify the directory structure is preserved:

assert!(result.is_some());
let parquet_path = result.unwrap();
assert_eq!(
    parquet_path.file_name().unwrap().to_str().unwrap(),
    "date=2020-01-21.hour=10.minute=30.region=us-west.ee529ffc8e76.data.random456.parquet"
);
+// Verify that the directory structure is preserved
+assert_eq!(
+    parquet_path.parent().unwrap(),
+    nested_dir
+);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a5df0c and d97723e.

📒 Files selected for processing (1)
  • src/parseable/streams.rs (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: coverage
🔇 Additional comments (3)
src/parseable/streams.rs (3)

197-200: Good use of constant instead of hardcoded extension

The change to use the ARROW_FILE_EXTENSION constant instead of a hardcoded string value improves code maintainability.


243-250: Improved error handling with warning for unexpected arrow files

Adding a warning log when encountering unexpected arrow files increases robustness and visibility of potential issues.


1215-1222: Well-structured test helper function

The create_test_file function is a good abstraction for test setup, making the test cases cleaner and more maintainable.

Comment on lines +1224 to +1239
#[test]
fn test_valid_arrow_path_conversion() {
let temp_dir = TempDir::new().expect("Failed to create temp dir");
let filename = "12345abcde&key1=value1.date=2020-01-21.hour=10.minute=30.key1=value1.key2=value2.ee529ffc8e76.data.arrows";
let file_path = create_test_file(&temp_dir, filename);
let random_string = "random123";

let result = arrow_path_to_parquet(&file_path, random_string);

assert!(result.is_some());
let parquet_path = result.unwrap();
assert_eq!(
parquet_path.file_name().unwrap().to_str().unwrap(),
"date=2020-01-21.hour=10.minute=30.key1=value1.key2=value2.ee529ffc8e76.data.random123.parquet"
);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Test doesn't check filename structure expectation properly

The expected output filename in test_valid_arrow_path_conversion is inconsistent with the current implementation of arrow_path_to_parquet. The test assumes a format that includes the "data" component which would be created by format!("{front}.data.{random_string}.parquet"), but the implementation uses format!("{front}.{random_string}.parquet").

Either update the test expectation or change the implementation to match. If you change the implementation, be sure to update line 75 as suggested above.

Comment on lines +1224 to +1264
#[test]
fn test_valid_arrow_path_conversion() {
let temp_dir = TempDir::new().expect("Failed to create temp dir");
let filename = "12345abcde&key1=value1.date=2020-01-21.hour=10.minute=30.key1=value1.key2=value2.ee529ffc8e76.data.arrows";
let file_path = create_test_file(&temp_dir, filename);
let random_string = "random123";

let result = arrow_path_to_parquet(&file_path, random_string);

assert!(result.is_some());
let parquet_path = result.unwrap();
assert_eq!(
parquet_path.file_name().unwrap().to_str().unwrap(),
"date=2020-01-21.hour=10.minute=30.key1=value1.key2=value2.ee529ffc8e76.data.random123.parquet"
);
}

#[test]
fn test_complex_path() {
let temp_dir = TempDir::new().expect("Failed to create temp dir");
let nested_dir = temp_dir.path().join("nested/directory/structure");
std::fs::create_dir_all(&nested_dir).expect("Failed to create nested directories");

let filename = "20200201T1830f8a5fc1edc567d56&key1=value1&key2=value2.date=2020-01-21.hour=10.minute=30.region=us-west.ee529ffc8e76.data.arrows";
let file_path = nested_dir.join(filename);

let mut file = File::create(&file_path).expect("Failed to create test file");
file.write_all(b"test content")
.expect("Failed to write to test file");

let random_string = "random456";

let result = arrow_path_to_parquet(&file_path, random_string);

assert!(result.is_some());
let parquet_path = result.unwrap();
assert_eq!(
parquet_path.file_name().unwrap().to_str().unwrap(),
"date=2020-01-21.hour=10.minute=30.region=us-west.ee529ffc8e76.data.random456.parquet"
);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add negative test cases for arrow_path_to_parquet function

While you've added tests for valid paths, there are no tests for invalid paths where the function should return None.

Add tests for cases where the function should return None:

#[test]
fn test_invalid_arrow_path_without_extension() {
    let temp_dir = TempDir::new().expect("Failed to create temp dir");
    let filename = "invalid_filename_without_extension";
    let file_path = create_test_file(&temp_dir, filename);
    let random_string = "random123";

    let result = arrow_path_to_parquet(&file_path, random_string);
    assert!(result.is_none());
}

#[test]
fn test_invalid_arrow_path_without_delimiters() {
    let temp_dir = TempDir::new().expect("Failed to create temp dir");
    let filename = "invalid_filename.arrows";
    let file_path = create_test_file(&temp_dir, filename);
    let random_string = "random123";

    let result = arrow_path_to_parquet(&file_path, random_string);
    assert!(result.is_none());
}

#[test]
fn test_invalid_arrow_path_without_second_delimiter() {
    let temp_dir = TempDir::new().expect("Failed to create temp dir");
    let filename = "schema.nodelimiter.arrows";
    let file_path = create_test_file(&temp_dir, filename);
    let random_string = "random123";

    // This would fail with the current implementation due to the assert!
    // but should return None if the assert is replaced with a check
    let result = arrow_path_to_parquet(&file_path, random_string);
    assert!(result.is_none());
}

Comment on lines +70 to +80
/// Returns the filename for parquet if provided arrows file path is valid as per our expectation
fn arrow_path_to_parquet(path: &Path, random_string: &str) -> Option<PathBuf> {
let filename = path.file_stem()?.to_str()?;
let (_, front) = filename.split_once('.')?;
assert!(front.contains('.'), "contains the delim `.`");
let filename_with_random_number = format!("{front}.{random_string}.parquet");
let mut parquet_path = path.to_owned();
parquet_path.set_file_name(filename_with_random_number);

Some(parquet_path)
}
Copy link

@coderabbitai coderabbitai bot Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Format string doesn't match expected parquet file format

The format string used for constructing the parquet filename on line 75 doesn't include the 'data' portion that's present in the arrow filename, but the tests expect it to be included.

Update the format string to maintain the same structure:

fn arrow_path_to_parquet(path: &Path, random_string: &str) -> Option<PathBuf> {
    let filename = path.file_stem()?.to_str()?;
    let (_, front) = filename.split_once('.')?;
    assert!(front.contains('.'), "contains the delim `.`");
-    let filename_with_random_number = format!("{front}.{random_string}.parquet");
+    let filename_with_random_number = format!("{front}.data.{random_string}.parquet");
    let mut parquet_path = path.to_owned();
    parquet_path.set_file_name(filename_with_random_number);

    Some(parquet_path)
}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// Returns the filename for parquet if provided arrows file path is valid as per our expectation
fn arrow_path_to_parquet(path: &Path, random_string: &str) -> Option<PathBuf> {
let filename = path.file_stem()?.to_str()?;
let (_, front) = filename.split_once('.')?;
assert!(front.contains('.'), "contains the delim `.`");
let filename_with_random_number = format!("{front}.{random_string}.parquet");
let mut parquet_path = path.to_owned();
parquet_path.set_file_name(filename_with_random_number);
Some(parquet_path)
}
/// Returns the filename for parquet if provided arrows file path is valid as per our expectation
fn arrow_path_to_parquet(path: &Path, random_string: &str) -> Option<PathBuf> {
let filename = path.file_stem()?.to_str()?;
let (_, front) = filename.split_once('.')?;
assert!(front.contains('.'), "contains the delim `.`");
let filename_with_random_number = format!("{front}.data.{random_string}.parquet");
let mut parquet_path = path.to_owned();
parquet_path.set_file_name(filename_with_random_number);
Some(parquet_path)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

front already carries ".data" at the end as asserted above

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@de-sh You're absolutely right. After tracing through the function with examples from the tests, I can see that "front" already contains ".data" at the end.

For example, when processing a file named "abc.date=2023-10-01.hour=12.minute=30.hostname.data.arrows":

  1. file_stem → "abc.date=2023-10-01.hour=12.minute=30.hostname.data"
  2. After split_once('.') → front = "date=2023-10-01.hour=12.minute=30.hostname.data"
  3. Result with format → "date=2023-10-01.hour=12.minute=30.hostname.data.{random_string}.parquet"

The original implementation is correct. Thanks for the clarification!


✏️ Learnings added
Learnt from: de-sh
PR: parseablehq/parseable#1239
File: src/parseable/streams.rs:70-80
Timestamp: 2025-03-24T06:08:47.139Z
Learning: In the Parseable codebase, arrow filenames follow a structure ending with ".data.arrows". When converting to parquet, the "front" part (after the first dot) already includes ".data" at the end, so there's no need to add it explicitly in the arrow_path_to_parquet function.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Contributor

@nikhilsinhaparseable nikhilsinhaparseable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to merge

@nitisht nitisht merged commit d9da876 into parseablehq:main Mar 24, 2025
14 checks passed
@de-sh de-sh deleted the regex branch March 24, 2025 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants